perm filename A01.TEX[262,RWF]1 blob
sn#869017 filedate 1989-01-18 generic text, type C, neo UTF8
COMMENT ⊗ VALID 00002 PAGES
C REC PAGE DESCRIPTION
C00001 00001
C00002 00002 \input macro[1,mps]
C00037 ENDMK
C⊗;
\input macro[1,mps]
\magnification\magstephalf
\baselineskip 14pt
\leftline{\sevenrm A01.Tex[262,rwf]\today}
\leftline{\copyright\sevenrm Robert W. Floyd, January 18, 1989}
\leftline{\sevenrm Unpublished draft}
\bigskip
\noindent{\bf Decision Trees and Entropy}
\smallskip
A {\sl $b$-ary node\/} $n$ is a finite (possibly empty) string of
$b$-ary digits; i.e., a~$b$-ary number. The father of a node is the
node minus its last digit. The $i$-th~{\sl son\/} of $n$ is~$ni$.
The {\sl siblings\/} of a node are the other sons of the father. If $n↓1$
is a prefix of~$n↓2$, then $n↓1$ is an {\sl ancestor\/} of~$n↓2$;
$n↓2$~is a {\sl descendant\/} of~$n↓1$.
A $b$-ary tree is a set of $b$-ary nodes which is closed under the
father (and therefore ancestor) relation; if it contains a node,
it contains all prefixes. If it is closed under the sibling
relation, it is a {\sl full\/} $b$-ary tree. In a binary
(2-ary) tree, we call the 0-th son the {\sl left son\/},
and the 1-st son the {\sl right son\/}.
We can associate with each node $n$ of a $b$-ary tree the subinterval
of $[0,1)$ consisting of those real numbers whose $b$-ary representation
begins with~$n$. We call these $b$-ary intervals.
\baselineskip0pt
$$\vcenter{\halign{\ctr{#}\xskip&\ctr{#}\xskip
&\ctr{#}\xskip&\ctr{#}\xskip&\ctr{#}\xskip&\ctr{#}\xskip
&\ctr{#}\xskip&\ctr{#}\xskip&\ctr{#}\quad
&\ctr{#}\xskip&\ctr{#}\xskip&\ctr{#}\xskip&\ctr{#}\xskip
&\ctr{#}\xskip&\ctr{#}\xskip&\ctr{#}\xskip&\ctr{#}\xskip
&\ctr{#}\xskip&\ctr{#}\xskip&\ctr{#}\xskip&\ctr{#}\xskip
&\ctr{#}\xskip&\ctr{#}\xskip&\ctr{#}\xskip&\ctr{#}\xskip
&\ctr{#}\xskip&&\ctr{#}\xskip&\ctr{#}\cr
&&&&$\Lambda$&&&&&&&&&&&&&&&$\Lambda$\cr
$\bullet$&&&&&&&&$\bullet$
&$\bullet$&&&&&&&&&&&&&&&&&&$\bullet$\cr
\noalign{\bigskip}
&&0&&&&1&&&&&&0&&&&&&1&&&&&&2\cr
$\bullet$&&&&$\bullet$&&&&$\bullet$
&$\bullet$&&&&&&$\bullet$&&&&&&$\bullet$&&&&&&$\bullet$\cr
\noalign{\bigskip}
&00&&01&&10&&11&&&00&&01&&02&&10&&11&&12&&20&&21&&22\cr
$\bullet$&&$\bullet$&&$\bullet$&&$\bullet$&&$\bullet$
&$\bullet$&&$\bullet$&&$\bullet$&&$\bullet$&&$\bullet$
&&$\bullet$&&$\bullet$&&$\bullet$&&$\bullet$&&$\bullet$\cr}}$$
\baselineskip14pt
\noindent
The ancestor relation on nodes is the inclusion relation on intervals;
the left (right) relation on nodes is the left (right) relation on
intervals.
The leaves of a $b$-ary tree are {\sl prefix-free\/}; none of
them is the prefix of another (or,indeed, of any other node),
so the leaves correspond to disjoint intervals.
The {\sl depth\/} of a node in the tree is its length. The Kraft
inequality is $\sum b↑{-d↓i}≤1$, where $d↓i$ is the depth of
the $i$-th node, and $i$ ranges over all leaves; equality
holds just for full trees. The inequality follows from the fact
that $b↑{-d↓i}$ is the length of the interval associated with the
$i↑{\rm th}$~node [Thanks, DEK].
The {\sl subtree\/} rooted at $n$ is the set of nodes of which
$n$ is a prefix, with the prefix deleted. To prove the Kraft
inequality by induction, apply it to the subtrees rooted at
0, 1, $\ldots$, $b-1$, noting that the depth of a node in the tree
is one greater than its depth in the subtree.
A $b$-ary {\sl prefix code\/} is a set of $b$-ary strings satisfying
the prefix condition; this set, with its ancestors, is a $b$-ary
{\sl decision tree\/}.
If we assign probabilities $p↓i$ to the leaves of a tree, the
{\sl weighted path length\/} ({\sl wpl\/}) is $\sum p↓id↓i$.
It is the expected cost of reaching a leaf from the root, if $n↓i$
is visited with probability~$p↓i$ and the cost of going from father
to son is one. We may select a leaf by asking questions with $b$
possible answers, each question asking for one more digit of the node
until we have the complete node. The expected number of such
questions is clearly the~wpl, and to any strategy for
selecting an object by asking $b$-ary questions there is a $b$-ary
tree with corresponding~wpl.
A set of mutually exclusive events $e↓i$ have probabilities~$p↓i$;
we want, by asking $b$-ary questions, to determine which event has
occurred. If we make a test for which the $b$ outcomes are equally
likely, the events inconsistent with the test outcome now have
probability zero; those consistent with it have $b$~times greater
probability. Continue until only one event $e↓i$ survives. Its
probability, now~1, must have been mutliplied by~$b$, exactly
$-\log↓bp↓i$ times. The average time to select the correct
event by such a $b$-section strategy, then, is
$H↓b(p)=-\sum p↓i\log↓bp↓i$, which we call the (base $b$) {\sl entropy\/}
of the distribution~$p$.
In a common special case, $b=2$,
and we omit the subscript; $H(p)=-\sum p↓i\lg p↓i$. To show
we can't do better, place the events in any $b$-ary decision tree.
The expected time is then the~wpl, which we shall show is at
least $H↓b(p)$. On the other hand, we shall show there is
always a tree with ${\rm wpl}<H↓b(p)+1$.
\smallskip\noindent
{\bf Theorem} (Shannon):
For any $p=\langle p↓1,p↓2,\ldots,p↓n\rangle$ there is
a $b$-ary tree with ${\rm wpl}<H↓b(p)+1$.
\smallskip\noindent
{\bf Proof:} Take the $p↓i$ in decreasing order. Assign each to a previously
unused $b$-ary interval $r↓i$ of the maximum size $≤p↓i$; the remaining
unused intervals will all be multiples of this size. The depth of $e↓i$
in this tree will be $-\log↓br↓i<-\log↓bp↓i+1$, so the wpl
is at most $\sum p↓i(-\log↓b p↓i+1)=H↓b(p)+1$.
\smallskip\noindent
{\bf Theorem:} For any $p=\langle p↓1,p↓2,\ldots,p↓n\rangle$ there
is an order-preserving $b$-ary decision tree with
${\rm wpl}<H↓b(p)+1+\log↓b2$.
\smallskip\noindent
{\bf Proof:} Let $r↓i$ be the interval $\bigl[\sum↓{j<i}p↓j\,,\;
\sum↓{j≤i}p↓j\bigr)$, of length~$p↓i$. Let $x$ be the length of
a longest $b$-ary interval included in~$r↓i$. Then $r↓i$ does not
contain the entirety of any $b$-ary interval of the next larger size,~$xb$,
so $p↓i<2xb$. Therefore $r↓i$ contains a
$b$-ary interval of length $>p↓i/(2b)$,
\baselineskip0pt
$$\vcenter{\halign{\ctr{#}$\;$&\ctr{#}\xskip%
&\ctr{#}$\;$\xskip&\ctr{#}$\;$\xskip&\ctr{#}$\;$\xskip&\ctr{#}\xskip$\;$%
&\ctr{#}$\;$\xskip&\ctr{#}$\;$\xskip&\ctr{#}\xskip&\ctr{#}$\;$&\ctr{#}\cr
&&&0&&&&1\cr
$\bullet$&&&&&$\bullet$&&&&&$\bullet$\cr
\noalign{\bigskip}
&&&&x\cr
\noalign{\medskip}
&&&\multispan2\downbracefill\cr
\noalign{\medskip}
&&00&&01&&10&&11\cr
$\bullet$&&&$\bullet$&&$\bullet$&&$\bullet$&&&$\bullet$\cr
\noalign{\medskip}
&\multispan8\upbracefill\cr
\noalign{\medskip}
&&&&&$r↓i$\cr}}$$
\baselineskip14pt
\noindent
which has tree depth $<-\log↓bp↓i+1+\log↓b2$; the resulting
${\rm wpl}<\sum p↓i(-\log↓bp↓i+1+\log↓b2)=H↓b(p)+1+\log↓b2$.
To see that this result can't be improved, let $\epsilon \ll 1$,
$b=2$, and let $p↓1=\epsilon$, $p↓2=1-2\epsilon$, $p↓3=\epsilon$.
\bigskip\noindent
{\bf Lemma.} For all $r>0$, $\ln r≤r-1$, with equality only
for $r=1$. We call this the {\sl logarithmic inequality\/}; most inequalities
concerning entropy are derived from it.
\bigskip\noindent
{\bf Proof:}
Let $f(x) = x-1-$ $\ln x$; $f(1) = 0$; $f'(x) = 1-1/x$, so $f'(x)$ goes from
negative to positive at $x=1$. Therefore $f(x) > 0$ for $x \not= 1$.
We use the logarithmic inequality to prove inequalities incorporating
logarithms, such as
$$\sum↓{i=1}↑nx↓i\log x↓i≥\biggl(\sum x↓i\biggr)\log\left(\,{\sum↓{i=1}↑n}x↓i
\over n\,\right)\,.\eqno(\ast)$$
Here is a paradigmatic application. We prove that, if
$f(n)=1+{a\over n}f(a)+{{n-a}\over n}f(n-a)$ ($0<a<n$),
$f(a)≥\lg a$, $f(n-a)≥\lg (n-a)$, then
$f(n)≥\lg n$, with possible equality only when $a=n/2$. That is, we want to
Show $1+{a\over n}\lg a+{{n-a}\over n}\lg(n-a)≥\lg n$, with equality
only if $a=n/2$.
\smallskip\noindent
{\bf Step 1:} Put all terms on the downhill side of the equation:
Show $\lg n-1-{a\over n}\lg a-{{n-a}\over n}\lg (n-a)≤0$.
\smallskip\noindent
{\bf Step 2:} Convert all logs to natural logs:
Show $\ln n-\ln 2-{a\over n}\ln a-{{n-a}\over n}\ln (n-a)≤0$.
\smallskip\noindent
{\bf Step 3:} Make all coefficients non-negative:
Show $\ln n+\ln{1\over 2}+{a\over n}\ln{1\over a}+{{n-a}\over n}\ln{1\over{n-a}}
≤0$.
\smallskip\noindent
{\bf Step 4:} Combine terms so that all logs are zero in the cases where
equality is obtained; in this case, where $a=n/2$:
Show ${a\over n}\left(\ln n+\ln{1\over 2}+\ln{1\over a}\right)+{{n-a}\over n}
\left(\ln n+\ln{1\over 2}+\ln{1\over{n-a}}\right)≤0$; that is,
Show ${a\over n}\ln{n\over{2a}}+{{n-a}\over n}\ln{n\over{2(n-a)}}≤ 0$.
\smallskip\noindent
{\bf Step 5:} Apply the logarithmic inequality to all logarithms.
${a\over n}\ln{n\over 2a}+{n-a\over n}\ln {n\over 2(n-a)}≤{a\over n}
\bigl({n\over 2a}-1\bigr)+{n-a\over n}\bigl({n\over 2(n-a)}-1\bigr)
={1\over 2}-{a\over n}+{1\over 2}-{n-a\over n}=0$.
\vfill\eject
%\bigskip
\noindent
{\bf Exercises:}
\smallskip
\disleft 38pt:(1):
Prove $(\ast)$.
\smallskip
\disleft 38pt:(2):
Prove if each $p↓{ij}>0$, $r↓i=\sum↓jp↓{ij}$, $s↓j=\sum↓ip↓{ij}$,
$t=\sum↓{ij}p↓{ij}$, then
$\sum↓{i,j}p↓{ij}\log p↓{ij}+t\log t≥\sum↓ir↓i\log r↓i+\sum↓js↓j\log s↓j$.
\smallskip
\disleft 38pt:(3):
Prove if $f(1)=0$, $f(n)=n+f(a↓n)+f(n-a↓n)$, then $f(n)≥n\lg n$.
\noindent{\bf Solution.}
\disleft 38pt:(2):
Show $\sum r↓i\ln r↓i+\sum s↓j\ln s↓j-\sum p↓{ij}\ln p↓{ij}-t\ln t≤0$;
\disleft 38pt::
Show $\sum r↓i\ln r↓i+\sum s↓j\ln s↓j+\sum p↓{ij}\ln{1\over{p↓{ij}}}+t\ln{1\over t}
≤0$;
\disleft 38pt::
Show $\sum↓{ij}p↓{ij}\ln r↓i+\sum↓{ij}p↓{ij}\ln s↓j+\sum↓{ij}p↓{ij}\ln
{1\over{p↓{ij}}}+\sum↓{ij}p↓{ij}\ln{1\over t}≤0$;
\disleft 38pt::
$\sum↓{ij}p↓{ij}\ln\left({{r↓is↓j}\over{t\,p↓{ij}}}\right)
≤\sum p↓{ij}\left({{r↓is↓j}\over{t\,p↓{ij}}}
-1\right)=\sum{{r↓is↓j}\over t}-\sum p↓{ij}={{t↑2}\over t}-t=0$.
\bigskip\noindent
{\bf Theorem} (the {\sl entropy theorem\/}). The weighted path length
of a $b$-ary tree with leaf probabilities $p=\langle p↓1,p↓2,\ldots,p↓n\rangle$
is at least $H↓b(p)$, with equality only if for each $i$
the $i$-th path length~$d↓i$ is $-\log↓bp↓i$.
\bigskip\noindent
{\bf Proof:} The path lengths satisfy the Kraft inequality $\sum b↑{-d↓i}≤1$.
Then
$$\eqalignno{\ln b\bigl(H↓b(p)-\hbox{wpl}\bigr)&=\ln b\sum p↓i\bigl(
\log↓b(b↑{-d↓i})-\log↓bp↓i\bigr)\cr
&=\sum p↓i\ln(b↑{-d↓i}/p↓i)\cr
&≤\sum p↓i(b↑{-d↓i}/p↓i-1)\cr
\noalign{\hbox{(See how handy the logarithmic inequality is!)}}
&=\sum b↑{-d↓i}-\sum p↓i≤0\,,\cr}$$
so ${\rm wpl}≥H↓b(p)$, with equality if and only if for each~$i$,
$b↑{-d↓i}=p↓i$.
A useful reference for properties of the entropy function is D. S.~Jones,
{\sl Elementary Information Theory\/}, Clarendon, Oxford, 1979,
Chapter~2.
\bigskip\noindent
{\bf Theorem} (the {\sl generalized Kraft inequality, or GKI\/}).
Regard the length of the $j$-th edge out of each interior vertex in a
$b$-ary tree as $l↓j>0$. Let $w$ be the unique positive number such
that $\sum↓{j=1}↑b w↑{-l↓j}=1$. If $d↓i$ is the distance (sum of lengths
of edges) from the root to the $i$-th leaf in such a $b$-ary tree, then
$\sum w↑{-d↓i}≤1$.
\bigskip\noindent{\bf Proof:}
Associate intervals with nodes, but this time associate the interval
$[0,1)$ with the root, and if interval~$x$ is associated with node~$n$,
cut it into $b$~subintervals~$x↓j$ associated with~$nj$, where
$|x↓j|/|x|=w↑{-l↓j}$. Then the length of the interval associated
with a path of length~$d$ is~$w↑{-d}$. Since the leaf intervals
are disjoint, the interval lengths sum to at most~1,
$\Sigma w↑{-d↓i}≤1$.
\bigskip
Another proof is by induction.
\smallskip
{\bf Basis:} One node. The root is a leaf, $d↓i=0$, $w↑0=1$.
\smallskip
{\bf Induction:} Root is $\Lambda$, subtrees have roots at a subset
of $0,1,2,\ldots,b-1$. Path lengths from root of $j$-th subtree to
leaves are $d↓i-l↓j$, and satisfy $\sum↓i w↑{-(d↓i-l↓j)}≤1$,
$w↑{l↓j}\sum↓i w↑{-d↓i}≤1$, $\sum↓i w↑{-d↓i}≤w↑{-l↓j}$. Summing over
all subtrees, total path length $\sum↓i w↑{-d↓i}≤\sum↓jw↑{-l↓j}=1$.
\bigskip\noindent
{\bf Theorem:} If path lengths and $w$ are defined as in the GKI,
for any $p=\langle p↓1,p↓2,\ldots,p↓n\rangle$ there is an
order-preserving $b$-ary decision tree with ${\rm wpl}<H↓w(p)
+{\rm constant}$, where the constant depends only on
$\langle l↓1,l↓2,\ldots,l↓b\rangle$.
\bigskip
Proof is analogous to the earlier special case.
\bigskip\noindent
{\bf Theorem} (the {\sl generalized entropy theorem, or GET\/}).
If path lengths are defined as in the generalized Kraft inequality,
the weighted path lengths satisfy ${\rm wpl}≥H↓w(p)$, with equality
only if the $i$-th path length $d↓i$ is $-\log↓wp↓i$.
\bigskip
Proof is analogous to the entropy theorem.
\bigskip
Let $q↓n$, for any node $n$, be the sum of the probabilities
of the leaves of the subtree rooted at~$n$. One shows by induction
that if equality holds in the GET, the path length from the root
to node~$n$ must be $-\log↓wq↓n$.
$$\vcenter{\halign{$\ctr{#}$\qq\qq&$\ctr{#}$&$\ctr{#}$\qq\q&$\ctr{#}$\cr
&n\cr
\noalign{\bigskip}
l↓0&&l↓1&l↓2\cr
\noalign{\medskip}
n0&n1&&n2\cr}}$$
\bigskip
\noindent
From this, it is readily seen that the relative probability of the $i$-th
path out of a node, $q↓{ni}/q↓n$, must be exactly $w↑{-l↓i}$. This suggests
a heuristic rule for efficient partitioning: choose each test so that
the probability of the $i$-th outcome is, as nearly as possible,~$w↑{-l↓i}$.
\bigskip\noindent
{\bf Example.} If true branches have path length 1 and false branches
have path length~2 in a binary tree, the value of $w$ in the generalized
Kraft inequality is determined by $w↑{-1}+w↑{-2}=1$, so $w=\phi$,
where $\phi$~is the golden ratio $\approx 1.618$. The GKI becomes
$\sum \phi↑{-d↓i}≤1$, and the wpl for probability distribution~$p$
is $≥H↓{\phi}(p)=H(p)\log 2/\log\phi$, with equality
only if $d↓i=-\log↓{\phi}p↓i=-\log(p↓i)/\log(\phi)$. At each test,
the probability of the true branch should be $\phi↑{-1}=0.618$, and that
of the false branch should be $\phi↑{-2}=0.382$.
\bigskip\bigskip
\noindent
{\bf Algorithmic Entropy}
A problem has $N$ different possible results, which we will represent as the
numbers $1$ to $N$. Result $j$ has probability $q↓j$. The {\sl entropy} of
the problem is $H(q)=-\sum↓jq↓j\lg q↓j$. This is non-negative, and is 0 only if
all the $q↓j$ but one are 0, i.e., the problem is solved.
A test that can be made in a program has $M$ different outcomes, which we will
represent as the numbers from $1$ to $M$. Outcome $i$ has probability $r↓i$.
The {\sl information} of the test is $H(r)=-\sum r↓i\lg r↓i$. This is non-negative,
and is~0 only if all the $q↓j$ but one are~0, i.e.\ one outcome is certain. For a
given number of outcomes, information is maximized by $r↓i=1/M$; then
$H(r)=\lg~M$.
Given a problem and a test, let $p↓{ij}$ be the probability that the test outcome
is $i$ and the problem result is $j$. Then $q↓j=\sum↓ip↓{ij}$,
$r↓{i}=\sum↓jp↓{ij}$.
If we make the test and get outcome $i$, the conditional probability of result
$j$ is $p↓{ij}/\sum↓jp↓{ij}=p↓{ij}/r↓i$. This gives a new entropy
$-\sum↓j(p↓{ij}/r↓i)\lg ({p↓{ij}/r↓i})$. The expected value of the new entropy is
$$\eqalign{\sum↓ir↓i\left( -\sum↓j({p↓{ij}/r↓i})\lg ({p↓{ij}/r↓i})\right)
&=-\sum↓{i,j}p↓{ij}\lg({p↓{ij}/r↓i})\cr
&=-\sum↓{i,j}p↓{ij}\lg p↓{ij}+\sum↓{i,j}p↓{ij}\lg r↓i\cr
&=-\sum↓{i,j}p↓{ij}\lg p↓{ij}+\sum↓{i,j}r↓i\lg r↓i\cr
&=-\sum↓{i,j}p↓{ij}\lg p↓{ij}-H(r)\cr}$$
We can prove that
$-\sum↓{ij}p↓{ij}\lg p↓{ij}≥-\sum q↓j\lg q↓j,$
with equality only if for each $j$ there is a $k$ with $p↓{kj}=q↓j$, and
$p↓{ij}=0$ for $i≠k$. Then we conclude: {\bf A test, on the average,
reduces the problem entropy by at most its information. To achieve even this
much reduction the test outcome must be a function of the problem result.}
\bigskip
Proof that $-\sum↓{ij}p↓{ij}\lg p↓{ij}≥-\sum q↓j\lg q↓j$. The inequality can be
shown for each $j$ separately. Let $\sum↓ip↓i=q$, $0≤p↓i$, $q≤1$. We want to show
$-\sum↓ip↓i\lg p↓i≥-q\lg q$, or $-\sum↓i{{p↓i}\over q}\lg p↓i≥-\lg q$, or
$-\sum↓i{{p↓i}\over q}\lg\left( {{p↓i}\over q}\right)≥0$.
Since $0≤p↓i/q≤1$, this is true term by term, with equality only if
$p↓i/q=0$, so that $p↓i=0$, or if $\lg (p↓i/q)=0$, so that $p↓i=q$.
\vfill\eject
\noindent{\bf Entropy of Searching and Sorting Problems}
\bigskip
If a system has equal likelihood of being in $n$ different states, we say
its {\it entropy} is $\lg n$. The expected time to determine which state
the system is in, by tests with two outcomes, can not be less than the
entropy. A test which is true in $pn$ of the states $(0<p<1)$ reduces the
entropy, on the average, from $\lg(n)$ to $p \lg(pn) + q
\lg(q n)$, where $q = 1-p$. The change in entropy is
$-(p \lg p + q \lg q) ≤ 1$, with equality only
when $p = 1/2$. This change is the {\it information} gained by the test.
The graph of information {\it vs.} $p$ looks a bit like a parabola passing through
$(0,0)$, $(1/2,1)$ and $(1,0)$; however, it is vertical at 0 and 1, and has
a slightly larger area than the parabola\quad ($1/(2\ln 2)$ {\it vs.} 2/3).
See next page for graph.
$$\vcenter{\halign{#\hfil\qquad%
&\hfil#\hfil\qquad%
&\hfil#\hfil\qquad%
&\hfil#\hfil\cr
\hfil$p$&$-p\ln p$&$H↓e(p)$&$H(p)$\cr
\noalign{\smallskip}
0.1&0.230&0.325&0.469\cr
0.2&0.322&0.500&0.722\cr
0.3&0.361&0.611&0.881\cr
0.4&0.367&0.673&0.971\cr
0.5&0.347&0.693&1.000\cr
0.6&0.306\cr
0.7&0.250\cr
0.8&0.179\cr
0.9&0.095\cr
\noalign{\smallskip}
0.01&0.046&0.056&0.081\cr
0.99&0.009\cr
0.02&0.078&0.098&0.141\cr
0.98&0.020\cr
0.05&0.150&0.199&0.286\cr
0.95&0.049\cr
0.15&0.285&0.423&0.610\cr
0.85&0.138\cr
0.25&0.347&0.562&0.811\cr
0.75&0.216\cr
}}$$
If a test has a variable probability, with distribution $f(p)\ (\int ↑1↓0
f(p)\,dp=1)$, the expected information from the test is $H(f) = - \int ↑1↓0 f(p)
(p\lg p + q \lg q)dp$. Often $f$ is a polynomial
$\sum ↓{i≥0} a↓i p↑i$; if so, the expected information is $\sum ↓{i≥0} a↓i b(i)$,
where $b(i)=- \int ↑1↓0 p↑i(p\lg p+ q \lg q)dp$.
Since the second part of this integral is messy for large $i$, we simplify the
problem using the fact that $H(f(p)) = H(f(1-p)) = H(g(p))$, where $g(p) =
.5(f(p) + f(1-p))$. For given $f$, $g$ is a distribution yielding the same
information and symmetric around $p=.5\,$.
\vfill\eject
\null\vfill\eject
From here on, we assume that $g(p)$ is symmetric around $p=1/2$, so
$$H(g) = - \int ↑1↓0 g(p) \cdot (p\lg p + q \lg q)dp = -2 \int ↑1↓0
g(p) \cdot p\lg p\, dp = - {2\over \ln 2} \int ↑1↓0 g(p) \cdot p\ln p\, dp.$$
When $f$ is a polynomial with the desired symmetry,
$\sum ↓{i≥0} a↓i p↑i$,
$$H(f) = - {2\over \ln 2}
\sum ↓{i=0} a↓i \int ↑1↓0 p↑{i+1} \ln p \, dp = {2\over \ln 2} \sum↓{i=0}
{a↓i\over(i+2)↑2}.$$
\noindent
{\bf Examples:}
\smallskip
\disleft 25pt:
(1): $f(p) = 1$; $H(f) = {1\over 2} {1\over \ln 2} = .7213$
\smallskip
\disleft 25pt:
(2): $f(p) = 6p(1-p)$; $H(f) =
6 \cdot {2\over \ln 2} \cdot \left({1\over 9} - {1\over 16}
\right) = {7\over 12} \cdot {1\over \ln 2} = .8416$
\smallskip
\disleft 25pt:
(3): $f(p) = 30p↑2(1-p)↑2$; $H(f) = 30 \cdot
{2\over \ln 2}\left({1\over 16}-{2\over 25}
+ {1\over 36}\right) = {37\over 60}\cdot{1\over \ln 2} = .8897$
\smallskip
\disleft 25pt:
(4): $f(p) = 140 p↑3(1-p)↑3$; $H(f)
= 140 \cdot {2\over \ln 2}\left({1\over 25} -
{3\over 36} + {3\over 49} - {1\over 64}\right) = {533\over 840} \cdot {1\over \ln 2} =
.9154$
\smallskip
Thus the information efficiency of Quicksort with sample sizes $1, 3, 5,$
and $7$ is respectively $.72, .84, .89,$ and $.92\, .$
In general, using a sample size of $s$ results in information efficiency
$$\left({1\over 1\cdot 2} + {1\over 3\cdot 4} + {1\over 5\cdot 6}
+ \cdots + {1\over s(s+1)}\right)
\cdot {1\over \ln 2}.$$
If a system has probability $q↓i$ of being in the $i$th of $n$ states, its
entropy is $ - \sum ↑n↓{i=1} q↓i \lg q↓i$. The expected time to determine
which state the system is in, by any algorithm, is no less than the entropy.
\smallskip\noindent
{\bf Example:} We are searching a table for a particular key. Its {\it a priori}
distribution is known to be normal, with variance V. Its distribution is then
$$q↓i = \sqrt{1\over 2πV} e↑{-i↑2/(2V)}.$$
The entropy of this distribution is approximately
$$-\int ↑∞↓{-∞} \sqrt{1\over{2\pi V}}\,
e↑{-x↑2/(2V)} \lg \left(\sqrt{1\over 2πV}e↑{-x↑2/(2V)}\right) dx
= ?\, .$$
[RWF: Correct this.]
\bigskip
\noindent
{\bf An entropy problem.}
\bigskip
We have a large file on external memory to sort. It contains $n$ records,
and has original entropy $\lg(n!)$, since there are $n!$ different
permutations that might be required. External memory is organized in
{\sl pages} of size $p$ records, which can be brought in or out only as
units; fast memory, where all comparisons and computations are done, has
room for only a few pages. We therefore decide on this outline of an
algorithm. We maintain two page-sized output buffers and one input buffer
in fast memory. We read in data, then, by whatever means, put each word
into one of the output buffers. When an output buffer is full, we return
its page to external memory. The last time we refer to a page, we sort
it.
Each decision of which buffer to put a record in gains at most one bit of
information. The final within-page sorts gain information at most
$(n/p)\lg(p!)$, or about $n\lg p$. The total information gained must be
at least equal, on the average, to the entropy $\lg (n!)$, or about $n\lg
n$, of the problem, so the number of times a record is placed in a buffer
must be at least $n\lg n-n\lg p=n\lg (n/p)$; if we organize the algorithm
in passes, each of which sees each record once, the number of passes is
$n\lg(n/p)/n=\lg(n/p)$. Thus any algorithm we can program to work in
$\lg(n/p)$ passes is close to the maximum achievable efficiency.
\bye